Design of a high-performance GEMM-like Tensor-Tensor Multiplication

نویسندگان

Paul Springer

Paolo Bientinesi

چکیده

We present " GEMM-like Tensor-Tensor multiplication " (GETT), a novel approach to tensor contractions that mirrors the design of a high-performance general matrix-matrix multiplication (GEMM). The critical insight behind GETT is the identification of three index sets, involved in the tensor contraction, which enable us to systematically reduce an arbitrary tensor contraction to loops around a highly tuned " macro-kernel ". This macro-kernel operates on suitably prepared (" packed ") sub-tensors that reside in a specified level of the cache hierarchy. In contrast to previous approaches to tensor contractions, GETT exhibits desirable features such as unit-stride memory accesses, cache-awareness, as well as full vectorization, without requiring auxiliary memory. To compare our technique with other modern tensor contractions, we integrate GETT alongside the so called Transpose-Transpose-GEMM-Transpose and Loops-over-GEMM approaches into an open source " Tensor Contraction Code Generator " (TCCG). The performance results for a wide range of tensor contractions suggest that GETT has the potential of becoming the method of choice: While GETT exhibits excellent performance across the board, its effectiveness for bandwidth-bound tensor contractions is especially impressive, outperforming existing approaches by up to 12.3×. More precisely, GETT achieves speedups of up to 1.42× over an equivalent-sized GEMM for bandwidth-bound tensor contractions while attaining up to 91.3% of peak floating-point performance for compute-bound tensor contractions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Strassen's Algorithm for Tensor Contraction

Tensor contraction (TC) is an important computational kernel widely used in numerous applications. It is a multi-dimensional generalization of matrix multiplication (GEMM). While Strassen’s algorithm for GEMM is well studied in theory and practice, extending it to accelerate TC has not been previously pursued. Thus, we believe this to be the first paper to demonstrate how one can in practice sp...

متن کامل

NVIDIA Tensor Core Programmability, Performance&Precision

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiplyand-accumulate on 4×4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to pro...

متن کامل

Assessment of the Log-Euclidean Metric Performance in Diffusion Tensor Image Segmentation

Introduction: Appropriate definition of the distance measure between diffusion tensors has a deep impact on Diffusion Tensor Image (DTI) segmentation results. The geodesic metric is the best distance measure since it yields high-quality segmentation results. However, the important problem with the geodesic metric is a high computational cost of the algorithms based on it. The main goal of this ...

متن کامل

High Performance Rearrangement and Multiplication Routines for Sparse Tensor Arithmetic

Researchers from diverse disciplines are increasingly incorporating numeric highorder data, i.e., numeric tensors, within their practice. Just like the matrix-vector (MV) paradigm, the development of multi-purpose, but high-performance, sparse data structures and algorithms for arithmetic calculations, e.g., those found in Einstein-like notation, is crucial for the continued adoption of tensors...

متن کامل

High-Performance Tensor Contraction without BLAS

Tensors are an integral part of many scientific disciplines [22], [16], [2], [12], [11]. At their most basic, tensors are simply a multidimensional collection of data (or a multidimensional array, as expressed in many programming languages). In other cases, tensors represent multidimensional transformations, extending the theory of vectors and matrices. The logic of handling, transforming, and ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1607.00145 شماره

صفحات -

تاریخ انتشار 2016

Design of a high-performance GEMM-like Tensor-Tensor Multiplication

نویسندگان

چکیده

منابع مشابه

Strassen's Algorithm for Tensor Contraction

NVIDIA Tensor Core Programmability, Performance&Precision

Assessment of the Log-Euclidean Metric Performance in Diffusion Tensor Image Segmentation

High Performance Rearrangement and Multiplication Routines for Sparse Tensor Arithmetic

High-Performance Tensor Contraction without BLAS

عنوان ژورنال:

اشتراک گذاری